Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Sun Nov 28 23:02:05 2021"
As a former computer science major, so far everything seems to have been pretty easy. Some problems with the DataCamp, it doesn’t always accept correct R scripts. I expect to learn abour using R to do my statistical work in my Thesis. Previously I have used SPSS, but always seemed to be lacking in the programming side. At least for now, R seems more familiar as a language due to, again, my former computer science experience.
My Diary: https://mvannas.github.io/IODS-project/
My Repository: https://github.com/mvannas/IODS-project
FYI: SSH authentication is now required by GitHub, while the installation of the public key in the Guthub was simple, I used instructions in https://cyberhelp.sesync.org/faq/set-up-gitlab-ssh-key.html to configure RStudio correctly for SSH key. Also if its the first time you PUSH commit to Github and the PUSH asks something, you need to answer ‘yes’ to install the host key.
test_data <- read.csv("data/work_data.csv")
str(test_data)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
The dataset is originally collected between 3.12.2014 and 10.1.2015. It a combined survey with questions from ASSIST (Approaches and Study Skills Inventory for Students) and SATS (Survey of Attitudes Toward Statistics). A subset of this data is used in this exercise that contains participant background of gender (binary) and age, and a combined mean score for deep, strategic and surface approach for learning. Also contained in our subset is the patients total points (from ASSIST) and attitude towards statistics (from SATS). After removing cases with zero points, 166 cases remained and 7 variables each.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
test_data %>% group_by(gender) %>% summarise_at(vars(age, points,deep,stra, surf, attitude), list(name = mean))
## # A tibble: 2 × 7
## gender age_name points_name deep_name stra_name surf_name attitude_name
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 24.9 22.3 3.66 3.20 2.83 29.9
## 2 M 26.8 23.5 3.72 2.96 2.70 34.4
p <- ggpairs(test_data, mapping = aes(col = gender), upper = list(continuous = wrap('cor', size = 3)), lower = list(combo = wrap("facethist", bins = 20)))
p
Here you can see the summary statistics and relationships between variables. There are more females than male participants, but there are no immediately observable differences between male and female score or points. Also the age average age of the participants is 24.9 (females) and 26.8 (males) with some outliers in both genders being clearly older.
Deep and surface approaches for learning seem to have a negative correlation for points, while strategic approach has positive correlation. Attitude also has a strong positive correlation to the overall points score.
model <- lm(points ~ attitude + stra + surf, data = test_data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = test_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Based on the previous correlation estimates, I selected attitude, strategic learning approaches score and surface approach learning score for multiple regression analysis. This summary shows that attitude has statistically the most significant, however estimated coefficient size is smallest (0.33952). It should be noted that this score is has not been averaged to 0-5 scale similarly to deep, stra and surf scores. Two options exist, you can either scale the explanatory variables to same or you can calculate standardized coefficient estimates for which there are ready R libraries.
library(QuantPsyc)
## Loading required package: boot
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'QuantPsyc'
## The following object is masked from 'package:base':
##
## norm
lm.beta(model)
## attitude stra surf
## 0.42039820 0.11170276 -0.05257769
The surf score seems to have the least estimate of coefficient, so I remove it from the equation and recalculate.
model <- lm(points ~ attitude + stra, data = test_data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra, data = test_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Reducing the amount of variables to just attitude and stra, but still stra score does not seem to be statistically significant (using p under 0.05 for statistical significance). Removing surf score did increase adjusted R-squared, which means this model is a better fit the observed data.
model <- lm(points ~ attitude, data = test_data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude, data = test_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Running the model with just attitude results in R-squared being lower. Thus I would conclude that the previous model with stra and attitude as the explanatory models is a better fit.
The multiple R-squared combines the variation of different explanatory variables in the model. Higher score is better, up to 1. 0 means the explanatory parameter do not explain any variation of observed values and 1 means they explain all the variations. The adjusted R-squared takes into account how much an added explanatory variable should increase the value, thus it grows only if the added variable add something useful to the model over pure chance.
Our linear model has some assumptions that the data used to model has to fulfill. Firstly, the model is linear, so the relationship between explanatory variables and target must be linear. Also usually it is assumed that the errors are normally distributed. The validity of these assumptions can be explored by looking at the residual plots.
model <- lm(points ~ attitude + stra, data = test_data)
par(mfrow = c(2,2))
plot(model, which = c(1,2,5))
Residuals vs Fitted values plot is used to check the model for constant variance of errors. Here in the top left panel we see our residuals vs fitted graphics. The scatter plot seems equally scattered without any obvious patterns. Therefore we conclude that our model when used with our data does have fairly constant variance of errors.
The normal Q-Q plot is used to check that our errors are normally distributed. The plot should show fairly little deviation from the line to be pass this check. Again in our model with our data it would seem that the errors are normally distributed.
The Residuals vs Leverage plot shows the residuals and Cook’s distance. Cook’s distance is estimation of influence of a observation on the parameters of the model. Usually values over 1 are suggestive of undue influence. Our model shows that while here are few observations that seem to have slightly more influence, the Cook’s value is still very low and thus there are no observation with undue power over the model.
Chapter 3 is broken.
Our next exercise uses on the ready dataset available, the Boston-dataset. This dataset has information on the city of Boston with descriptive variables for observed suburbs.
library(MASS)
library("ggplot2")
library("GGally")
library("corrplot")
## corrplot 0.90 loaded
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.5 ✓ purrr 0.3.4
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x MASS::select() masks dplyr::select()
library("vtable")
## Loading required package: kableExtra
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
There are 506 observations with 14 different variables.
| Variable | Explanation |
|---|---|
| rim | per capita crime rate by town |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres per town |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |
| nox | nitrogen oxides concentration (parts per 10 million) |
| rm | average number of rooms per dwelling |
| age | proportion of owner-occupied units built prior to 1940 |
| dis | weighted mean of distances to five Boston employment centres |
| rad | index of accessibility to radial highways |
| tax | full-value property-tax rate per $10,000 |
| ptratio | pupil-teacher ratio by town |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town |
| lstat | lower status of the population (percent) |
| medv | median value of owner-occupied homes in $1000s |
Information on variables from https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
ggpairs(Boston, upper = list(continuous = wrap('cor', size = 2)))
ggcorr(cor(Boston), geom = "circle", nbreaks = 11)
sumtable(Boston)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| crim | 506 | 3.614 | 8.602 | 0.006 | 0.082 | 3.677 | 88.976 |
| zn | 506 | 11.364 | 23.322 | 0 | 0 | 12.5 | 100 |
| indus | 506 | 11.137 | 6.86 | 0.46 | 5.19 | 18.1 | 27.74 |
| chas | 506 | 0.069 | 0.254 | 0 | 0 | 0 | 1 |
| nox | 506 | 0.555 | 0.116 | 0.385 | 0.449 | 0.624 | 0.871 |
| rm | 506 | 6.285 | 0.703 | 3.561 | 5.886 | 6.624 | 8.78 |
| age | 506 | 68.575 | 28.149 | 2.9 | 45.025 | 94.075 | 100 |
| dis | 506 | 3.795 | 2.106 | 1.13 | 2.1 | 5.188 | 12.126 |
| rad | 506 | 9.549 | 8.707 | 1 | 4 | 24 | 24 |
| tax | 506 | 408.237 | 168.537 | 187 | 279 | 666 | 711 |
| ptratio | 506 | 18.456 | 2.165 | 12.6 | 17.4 | 20.2 | 22 |
| black | 506 | 356.674 | 91.295 | 0.32 | 375.378 | 396.225 | 396.9 |
| lstat | 506 | 12.653 | 7.141 | 1.73 | 6.95 | 16.955 | 37.97 |
| medv | 506 | 22.533 | 9.197 | 5 | 17.025 | 25 | 50 |
From the summary statistics we can observe that the variables have wide range of values, chas is only a twofold category with values 0 or 1, to a range of 0 to 400 (black population proportion). With the exception of chas, most of the variables seem to have fairly strong correlations with at least some of the other values.
bsca <- scale(Boston)
bsca <- as.data.frame(bsca)
sumtable(bsca, add.median = TRUE)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 50 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|---|
| crim | 506 | 0 | 1 | -0.419 | -0.411 | -0.39 | 0.007 | 9.924 |
| zn | 506 | 0 | 1 | -0.487 | -0.487 | -0.487 | 0.049 | 3.8 |
| indus | 506 | 0 | 1 | -1.556 | -0.867 | -0.211 | 1.015 | 2.42 |
| chas | 506 | 0 | 1 | -0.272 | -0.272 | -0.272 | -0.272 | 3.665 |
| nox | 506 | 0 | 1 | -1.464 | -0.912 | -0.144 | 0.598 | 2.73 |
| rm | 506 | 0 | 1 | -3.876 | -0.568 | -0.108 | 0.482 | 3.552 |
| age | 506 | 0 | 1 | -2.333 | -0.837 | 0.317 | 0.906 | 1.116 |
| dis | 506 | 0 | 1 | -1.266 | -0.805 | -0.279 | 0.662 | 3.957 |
| rad | 506 | 0 | 1 | -0.982 | -0.637 | -0.522 | 1.66 | 1.66 |
| tax | 506 | 0 | 1 | -1.313 | -0.767 | -0.464 | 1.529 | 1.796 |
| ptratio | 506 | 0 | 1 | -2.705 | -0.488 | 0.275 | 0.806 | 1.637 |
| black | 506 | 0 | 1 | -3.903 | 0.205 | 0.381 | 0.433 | 0.441 |
| lstat | 506 | 0 | 1 | -1.53 | -0.799 | -0.181 | 0.602 | 3.545 |
| medv | 506 | 0 | 1 | -1.906 | -0.599 | -0.145 | 0.268 | 2.987 |
After using the R scale-fundtion we have how standardized the values. Looking at the standardized summary table, it seems fairly even, however the crim variables seems to have a fairy large variance as the maximum stardardized value is three times larger than in any other category.
# Create quatiles from crim value as separate categories
crime <- cut(bsca$crim, breaks = quantile(bsca$crim), label =c("low","med_low", "med_high", "high"), include.lowest = TRUE)
bsca <- dplyr::select(bsca, -crim)
bsca <- data.frame(bsca, crime)
# Create test and training sets (80 % to train set)
ind <- sample(nrow(bsca), size = nrow(bsca) * 0.8)
train <- bsca[ind,]
test <- bsca[-ind,]
# Copy correct crme categories from test set and remove from set
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
# Shamelessly copied from Datacamp excercise
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2351485 0.2549505 0.2574257 0.2524752
##
## Group means:
## zn indus chas nox rm age
## low 0.92296626 -0.9009418 -0.10655643 -0.8644534 0.41735289 -0.8509330
## med_low -0.07949111 -0.2759888 -0.04298342 -0.5709646 -0.13455023 -0.3018332
## med_high -0.39571399 0.1480147 0.18195173 0.3988551 0.08386261 0.3806720
## high -0.48724019 1.0171096 -0.07933396 1.0547891 -0.40860826 0.8086769
## dis rad tax ptratio black lstat
## low 0.8853003 -0.6796430 -0.7280939 -0.47053909 0.37312954 -0.75325821
## med_low 0.4058627 -0.5380946 -0.4748703 -0.05838825 0.31399003 -0.09882069
## med_high -0.3713910 -0.3877604 -0.3008162 -0.22018450 0.05211565 0.03921177
## high -0.8683753 1.6382099 1.5141140 0.78087177 -0.92470369 0.88757270
## medv
## low 0.49224121
## med_low -0.02489075
## med_high 0.13872274
## high -0.67428113
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08483102 0.735059303 -0.946528390
## indus -0.03057688 -0.175694994 0.516979849
## chas -0.10065380 -0.055712537 -0.005489093
## nox 0.40795329 -0.889560908 -1.334254603
## rm -0.10133878 -0.136008952 -0.130604301
## age 0.25934018 -0.209862908 0.082722146
## dis -0.09972813 -0.285641206 0.416202167
## rad 2.95005773 0.953348706 0.041864765
## tax 0.03553384 0.006594074 0.390011020
## ptratio 0.14302224 -0.109188558 -0.321861468
## black -0.14229325 0.023673885 0.150810414
## lstat 0.20030651 -0.249856741 0.366853588
## medv 0.18070509 -0.442741021 -0.130299854
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9473 0.0388 0.0138
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
From the biplot and also quite easily from the coefficients list on the linear discriminats output we can see that rad variable, aka. index of accessibility to radial highways, has a very strong correlation for high crimes rates.
For the lower three categories of crime, it seems that area with larger plots sizes have lower crime rates. Also the nitrogen oxide concentration seems to a have similar but reverse effect (higher concentration, more crime).
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 20 11 1 0
## med_low 4 17 2 0
## med_high 1 6 15 0
## high 0 0 0 25
The generated model seems to predict higher crime rate areas better. The highest being clear a separate cluster in the biplot most likely explains this and the three other categories there is more variance in the predict-algorithms results.
# Need to redo scaling version for K-mean to work
bsca <- scale(Boston)
dist_euclidian <- dist(bsca)
summary(dist_euclidian)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
kc <- 6
km <- kmeans(bsca, centers = 4)
TWCSS <- sapply(1:kc, function(k){kmeans(Boston, k)$tot.withinss})
bsca <- as.data.frame(bsca)
pairs(bsca[1:5], col = km$cluster)
pairs(bsca[6:10], col = km$cluster)
pairs(bsca[11:14], col = km$cluster)
qplot(x = 1:kc, y = TWCSS, geom = 'line')
Looking the paired data or the more informative total of within cluster sum of squares plot we can see that clustering to two groups would capture most differences in the variables.